AITopics | efficient llm

Collaborating Authors

efficient llm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SIRIUS : Contexual Sparisty with Correction for Efficient LLMs

Neural Information Processing SystemsDec-24-2025, 15:01:12 GMT

With the blossom of large language models (LLM), inference efficiency becomes increasingly important. Various approximate methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without significant performance degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, it significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models and original models often share the general problem-solving logic and require only a few token corrections to recover the original model performance. This paper introduces SIRIUS, an efficient correction mechanism, which significantly boosts CS models on reasoning tasks while maintaining its efficiency gain. SIRIUS is evaluated on 6 models with 8 difficult generation tasks in reasoning, deduction, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for SIRIUS and show that SIRIUS delivers theoretical latency reduction with roughly a 20% reduction in latency for 8B model on-chip and a 35% reduction in latency for 70B model offloading.

contexual sparisty, large language model, natural language, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)

Add feedback

SIRIUS : Contexual Sparisty with Correction for Efficient LLMs

Neural Information Processing SystemsMay-26-2025, 19:44:05 GMT

artificial intelligence, large language model, natural language, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)

Add feedback

Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

Behdin, Kayhan, Dai, Yun, Fatahibaarzi, Ata, Gupta, Aman, Song, Qingquan, Tang, Shao, Sang, Hejian, Dexter, Gregory, Zhu, Sirou, Zhu, Siyu, Dharamsi, Tejas, Sanjabi, Maziar, Kothapalli, Vignesh, Firooz, Hamed, Fu, Zhoutong, Cao, Yihan, Hsu, Pin-Lun, Borisyuk, Fedor, Wang, Zhipeng, Mazumder, Rahul, Pillai, Natesh, Simon, Luke

arXiv.org Artificial IntelligenceFeb-20-2025

Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendations to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present methods and insights for training small language models (SLMs) that deliver high performance and efficiency in deployment. We focus on two key techniques: (1) knowledge distillation and (2) model compression via quantization and pruning. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training, serving costs, and latency. We detail the impact of these techniques on a variety of use cases at a large professional social network platform and share deployment lessons - including hardware optimization strategies that enhance speed and throughput for both predictive and reasoning-based applications.

arxiv preprint arxiv, distillation, pruning, (13 more...)

arXiv.org Artificial Intelligence

2502.14305

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Sirius: Contextual Sparsity with Correction for Efficient LLMs

Zhou, Yang, Chen, Zhuoming, Xu, Zhaozhuo, Lin, Victoria, Chen, Beidi

arXiv.org Artificial IntelligenceSep-5-2024

With the blossom of large language models (LLMs), inference efficiency becomes increasingly important. Various approximation methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without quality degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models often share general problem-solving logic and require only a few token corrections to recover the original model performance. This paper introduces Sirius, an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks while maintaining its efficiency gain. Sirius is evaluated on 6 models with 8 difficult generation tasks in reasoning, math, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for Sirius and show that Sirius achieves roughly 20% reduction in latency for 8B model on-chip and 35% reduction for 70B model offloading. We open-source our implementation of Sirius at https://github.com/Infini-AI-Lab/Sirius.git.

contextual sparsity, correction, efficient llm, (1 more...)

arXiv.org Artificial Intelligence

2409.03856

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

Scaling Efficient LLMs

Kausik, B. N.

arXiv.org Artificial IntelligenceFeb-22-2024

Trained LLMs are typically sparse in that most of the parameters are zero, raising questions on efficiency. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, we compare theoretical and empirical estimates for training loss at current scale to obtain upper and lower bounds on the number of unique sequences in a natural training corpus as a function of its size. Our result implies (1) to double the number of skills represented in a training corpus, the corpus must scale roughly between three and five fold (2) for efficient LLMs, the number of parameters $N$ and the size $D$ of a natural training corpus scale as $N \sim D^{0.58}$ (3) if the number of parameters of an LLM is smaller than the number of unique sequences in the training corpus, scaling up can uncover emergent skills.

llm, sequence, training corpus, (13 more...)

arXiv.org Artificial Intelligence

2402.14746

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

Liu, Zichang, Wang, Jue, Dao, Tri, Zhou, Tianyi, Yuan, Binhang, Song, Zhao, Shrivastava, Anshumali, Zhang, Ce, Tian, Yuandong, Re, Christopher, Chen, Beidi

arXiv.org Artificial IntelligenceOct-26-2023

Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2X compared to the state-of-the-art FasterTransformer, and over 6X compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu.

contextual sparsity, sparsity, step follow, (12 more...)

arXiv.org Artificial Intelligence

2310.17157

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(7 more...)

Genre:

Workflow (0.93)
Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback